Text Line Extraction from Complex Layout Documents

نویسندگان

  • Brij Mohan Singh
  • Ankush Mittal
  • Vivek Chand
  • Debashish Ghosh
چکیده

There are numerous stylish documents which do not have the traditional text layouts where printed text regions are not parallel to each other. Such complex layouts make text line extraction challenging due to multi-orientation of paragraphs. This paper introduces a system for the text line extraction from the complex layout documents. Proposed method is based on the concept of dilation and histogram profiling. The text regions are extracted using dilation and food fill based approach, then paragraph orientation is determined and individual text lines are extracted. The accuracy of extracted text lines are evaluated using the new proposed concept that is also based on the histogram profiling. The results of proposed approach on the complex layouts are promising. General Terms Document Analysis and Recognition, Optical Character Recognition.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Extraction from Document Images using Attention Based Layout Segmentation

Introduction The attention of a human reader and the reading speed strongly depends on the layout of a document The term layout is used for the geometrical arrangement of document components (i.e. text, graphics and figures) on the page as well as for the typographic features of the text (i.e. font type, style, size, alignment and line spacing). Although the human visual and cognitive perceptio...

متن کامل

Automatic Detection of Font Size Straight from Run Length Compressed Text Documents

Automatic detection of font size finds many applications in the area of intelligent OCRing and document image analysis, which has been traditionally practised over uncompressed documents, although in real life the documents exist in compressed form for efficient storage and transmission. It would be novel and intelligent if the task of font size detection could be carried out directly from the ...

متن کامل

Detection, Extraction and Representation of Tables

We are concerned with the extraction of tables from exchange format representations of very diverse composite documents. We put forward a flexible representation scheme for complex tables, based on a clear distinction between the physical layout of a table and its logical structure. Relying on this scheme, we develop a new method for the detection and the extraction of tables by an analysis of ...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

Natural Language Inspired Approach for Handwritten Text Line Detection in Legacy Documents

Document layout analysis is an important task needed for handwritten text recognition among other applications. Text layout commonly found in handwritten legacy documents is in the form of one or more paragraphs composed of parallel text lines. An approach for handwritten text line detection is presented which uses machinelearning techniques and methods widely used in natural language processin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011